Steve Elston
10/13/2022
The concept of likelihood and maximum likelihood estimation (MLE) have been at the core of much of statistical modeling for about 100 years
In 21st Century, likelihood and MLE ideas continue to be foundational
Understanding the concept of likelihood and the use of MLE
methods is key to understanding many parametric statistical
methods
Likelihood is a measure of how well a model fits data
MLE is a generic methods for parameter estimation
MLE used widely for machine learning models, including some deep learning models
Statistical inference seeks to characterize the uncertainty in statistical point estimates
Statistics are estimates of population parameters
Inferences using statistics must consider the uncertainty in the estimates
Confidence intervals quantify uncertainty in statistical estimates
Nonparametric bootstrap estimation is widely useful and requires minimal assumption
Bootstrap distribution is comprised of values of the statistic computed from bootstrap resamples of the original observations (data sample)
Computing bootstrap distribution requires no assumptions about population distribution!
Bootstrap resampling estimates the bootstrap distribution of a statistic
There are several variations of the basic nonparametric bootstrap algorithm
Re-sampling methods are general and powerful but, there is no magic involved! There are pitfalls!
Despite the long history, Bayesian models have not been used extensively until recently
Bayesian analysis is a contrast to frequentist methods
The objective of Bayesian analysis is to compute a posterior
distribution
Contrasts with frequentist statistics is to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information in the form of
a prior distribution
Selection of prior distributions can be performed in a number of ways
The posterior distribution is said to quantify our current
belief
We update beliefs based on additional data or evidence
A critical difference with frequentist models which must be computed from a complete sample
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval
Bayesian methods made global headlines with the successful location of the missing Air France Flight 447
Aircraft had disappeared in little traveled area of the South Atlantic Ocean
Conventional location methods had failed to locate the wreckage;
potential search area too large
Bayesian methods rapidly narrowed the prospective search area
Posterior distribution of locations of Air France 447
With greater computational power and general acceptance, Bayes methods are now widely used
Among pragmatists
Bayes models allow us to express prior information
Models that fall between these extremes are also in common use
Can compare the contrasting frequentist and Bayesian approaches
Comparison of frequentist and Bayes methods
Bayes’ Theorem is fundamental to Bayesian data analysis.
\[P(A \cap B) = P(A|B) P(B) \]
We can also write:
\[P(A \cap B) = P(B|A) P(A) \]
Eliminating \(P(A \cap B):\)
\[ P(B)P(A|B) = P(A)P(B|A)\]
And finally, Bayes theorem!
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
Bayes Theorem!
In many cases we are interested in the marginal distribution
\[p(\theta_1) = \int_{\theta_2, \ldots, \theta_n} p(\theta_1, \theta_2, \ldots, \theta_n)\ d\theta2, \ldots, d \theta_n\] - But computing this integral is not easy!
For discrete distributions compute the marginal by summation
Or, for discrete samples of a continuous distribution
Example, we know posterior distribution of a parameter \(\theta\) but really want marginal distribution of the parameter value
\[ p(\theta) = \sum_{x \in \mathbf{X}} p(\theta |\mathbf{X})\ p(\mathbf{X}) \]
Now we have the marginal distribution of \(\theta\)
Or, we need to find the denominator for Bayes theorem to normalize our posterior distribution:
\[ p(\mathbf{X}) = \sum_{\theta \in \Theta} p(\mathbf{X} |\theta) p(\theta) \]
How can you interpret Bayes’ Theorem?
\[Posterior\ Distribution = \frac{Likelihood \bullet Prior\ Distribution}{Evidence} \]
\[ posterior\ distribution(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \\ \frac{Likelihood(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃rior(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]
\[ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \frac{P(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]
What do these terms actually mean?
Posterior distribution of the parameters given the evidence or data, the goal of Bayesian analysis
Prior distribution is chosen to express information available about the model parameters apriori
Likelihood is the conditional distribution of the data given the model parameters
Data or evidence is the distribution of the data and normalizes the posterior
Relationships can apply to the parameters in a model; partial slopes, intercept, error distributions, lasso constants, etc
We need a tractable formulation of Bayes Theorem for computational problems
\[ 𝑃(𝐵 \cap A) = 𝑃(𝐵|𝐴)𝑃(𝐴) \\ And \\ 𝑃(𝐵)=𝑃(𝐵 \cap 𝐴)+𝑃(𝐵 \cap \bar{𝐴}) \]
Where, \(\bar{A} = not\ A\), and the marginal distribution, \(P(B)\), can be written:
\[ 𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴}) \]
Using the foregoing relations we can rewrite Bayes Theorem as:
\[ P(A|B) = \frac{P(A)P(B|A)}{𝑃(𝐵│𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})} \\ \]
Eewrite Bayes Theorem as:
\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]
Ignoring the normalization constant \(k\):
\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]
Denominator must account for all possible outcomes, or alternative hypotheses, \(h'\):
\[Posterior(hypothesis\ |\ evidence) =\\ \frac{Likelihood(evidence\ |\ hypothesis)\ prior(hypothesis)}{\sum_{ h' \in\ All\ possible\ hypotheses}Likelihood(evidence\ |\ h')\ prior(h')}\]
This is a formidable problem!
Hemophilia is a serious genetic condition expressed on any X chromosome
As evidence the woman has two sons (not identical twins) with no expression of hemophilia
What is the likelihood for the two sons \(X = (x_1,x_2)\) not having hemophilia?
Two possible cases here
\[ p(x_1=0, x_2=0 | \theta = 1) = 0.5 * 0.5 = 0.25 \\ p(x_1=0, x_2=0 | \theta = 0) = 1.0 * 1.0 = 1.0 \]
Note: we are neglecting the possibility of a mutations in one of the sons
Use Bayes theorem to compute probability woman carries an X chromosome with hemophilia expression, \(\theta = 1\)
\[ p(\theta=1 | X) = \frac{p(X|\theta=1) p(\theta=1)}{p(X|\theta=1) p(\theta=1) + p(X|\theta=0) p(\theta=0)} \\ = \frac{0.25 * 0.5}{0.25 * 0.5 + 1.0 * 0.5} = 0.20 \]
The evidence of two sons without hemophilia causes us to update our belief that the probability of the woman carrying the disease
Note: The denominator is the sum over all possible hypthises, the marginal distribution of the observations \(\mathbf{X}\)
How to we interpret the foregoing relationship?
\[Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\ Or\\ 𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 │ 𝑑𝑎𝑡𝑎 ) \propto 𝑃( 𝑑𝑎𝑡𝑎 | 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) \]
We can find an unnormalized function proportional to the posterior distribution
Sum over the function to find the marginal distribution \(P(B)\)
Approach can transform an intractable computation into a simple summation
The goal of a Bayesian analysis is computing and performing inference on the posterior distribution of the model parameters
The general steps are as follows:
Identify data relevant to the research question
Define a sampling plan for the data. Data need not be collected in a single batch
Define the model and the likelihood function; e.g. regression model with Normal likelihood
Specify a prior distribution of the model parameters
Use the Bayesian inference formula to compute posterior distribution of the model parameters
Update the posterior as data is observed
Inference on the posterior can be performed; compute credible intervals
Optionally, simulate data values from realizations of the posterior distribution. These values are predictions from the model.
An advantage of Bayesain model is that it can be updated as new observations are made
In contrast, for frequentist models data must be collected completely in advance
We update our belief by adding new evidence
The posterior of a Bayesian model with no evidence is the prior
The previous posterior serves as a prior for model updates
The choice of the prior is a difficult, and potentially vexing, problem when performing Bayesian analysis
The need to choose a prior has often been cited as a reason why
Bayesian models are impractical
General guidance is that a prior must be convincing to a skeptical audience is vague in practice
Some possible approaches include:
Use prior empirical information
Apply domain knowledge to determine a reasonable distribution
If there is poor prior knowledge for the problem a non-informative prior can be used
How to use prior empirical information to estimate the parameters of the prior distribution
An analytically and computationally simple choice for a prior distribution family is a conjugate prior
When a likelihood function is multiplied by its conjugate distribution the posterior distribution will be in the same family as the conjugate prior
Attractive idea for cases where the conjugate distribution exists
But there are many practical cases where a conjugate prior is not used
Most commonly used distributions have conjugates, with a few examples:
| Likelihood | Conjugate |
|---|---|
| Binomial | Beta |
| Bernoulli | Beta |
| Poisson | Gamma |
| Categorical | Dirichlet |
| Normal - mean | Normal |
| Normal - variance | Inverse Gamma |
| Normal - inverse variance, \(\tau\) | Gamma |
We are interested in analyzing the incidence of distracted drivers
Randomly sample the behavior of 10 drivers at an intersection and determine if they exhibit distracted driving or not
Data are Binomially distributed, a driver is distracted or not, with likelihood:
\[ P(k) = \binom{n}{k} \cdot \theta^k(1-\theta)^{n-k}\]
Our process is the:
What are the properties of the Beta distribution?
Beta distribution for different parameter values
Consider the product of a Binomial likelihood and a Beta prior
Define the evidence as \(n\) trials with \(z\) successes
Prior is a Beta distribution with parameters \(a\) and \(b\), or the vector \(\theta = (a,b)\)
From Bayes Theorem the distribution of the posterior:
\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ p(\theta | z, n) &= \frac{Binomial(z,n | \theta)\ Beta(\theta)}{p(z,n)} \\ &= Beta(z + a -1,\ n-z+b-1) \end{align}\]
There are some useful insights you can gain from this relationship:
\[ posterior(\theta | z, n) = Beta(z + a -1,\ n-z+b-1) \]
Posterior distribution is in the Beta family, as a result of conjugacy
Parameters \(a\) and \(b\) are determined by the prior and the evidence
Parameters of the prior can be interpreted as pseudo counts of successes, \(a = pseudo\ success + 1\) and failures, \(b = pseudo\ failure + 1\)
-Evidence is also in the form (actual) counts of successes, \(z\) and failure, \(n-z\)
- The more evidence the greater the influence on the posterior
distribution
- Large amount of evidence will overwhelm the prior
- With large amount of evidence, posterior converges to frequentist
model
Consider example with:
- Prior pseudo counts \([1,9]\),
successes \(a = 1 + 1\) and failures,
\(b = 9 + 1\)
- Evidence, successes \(= 10\) and
failures, \(= 30\)
- Posterior is \(Beta(10 + 2 -1,\ 40 - 10 + 10
-1) = Beta(11,\ 39)\)
Prior, likelihood and posterior for distracted driving
How can we find an estimate of the poster distribution?
We can sample from the analytic solution - if we have a conjugate
We can sample the likelihood and prior, take the product and normalize - for any posterior
Grid sample or Markov chain Monte Carlo (MCMC) sample
Grid sampling is a naive approach
Sampling grid for bivariate distribution
Algorithm for grid sampling to compute posterior from likelihood and prior
Procedure CreateGrid(variables, lower_limits, upper_limits):
# Build the sampling grid
return sampling_grid
Procedure SampleLikelihood(sampling_value, observation_values):
return likelihood_function(sampling_value, observation_values)
Procedure Prior(sampling_values, prior_parameter_value):
return prior_density_function(sampling_value, prior_parameter_values)
ComputePosterior(variables, lower_limits, upper_limits):
# Initialize the sampling grid
Grid = CreateGrid(variables, lower_limits, upper_limits)
# Initialize array to hold sampled posterior values
array posterior[range(Grid)]
# Compute posterior at each sampling value in the grid
for sampling_value in range(lower_limits, upper_limits):
likelihood = SampleLikelihood(sampling_value, observation_values)
prior = Prior(sampling_values, prior_parameter_value)
posterior[sampling_value] = likelihood * prior
# Normalize the posterior
probability_data = sum(posterior[range(Grid)])
posterior = posterior[range(Grid)]/probability_data
return posterior How can we specify the uncertainty for a Bayesian parameter estimate?
For frequentist analysis we use confidence intervals, but not entirely appropriate
For Bayesian analysis inference performed on posterior distribution
Example, the \(\alpha = 0.90\) credible interval encompasses the 90% of the posterior distribution with the highest density
The credible interval is sometime called the highest density interval (HDI), or highest posterior density interval (HPDI)
For symmetric distributions the credible interval can be numerically the same as the confidence interval
What are the 95% credible intervals for \(Beta(11,\ 39)\)?
Probability of distract drivers for next 10 cars
How are credible intervals different from the more familiar confidence intervals?
Confidence intervials and credible intervals are conceptually quite different
A confidence interval is a purely frequentest concept
- Is an interval on the sampling distribution where
repeated samples of a statistic are expected with probability \(= \alpha\)
- Cannot interpret a confidence interval as an interval
on a probability distribution of the value of a statistic!
Credible interval is an interval on a posterior distribution of the
statistic
- Credible interval is exactly what the misinterpretation of the
confidence interval tries to be
- Credible interval is the interval with highest \(\alpha\) probability for the statistic
being estimated
Compare confidence interval and credible interval for the case of 10 observations
Credible intervals cross the density function at exactly the same density
Confidence intervals have the same CDF in the tails beyond the interval
Difference between credible and confidence intervals
What else can we do with a Bayesian posterior distribution beyond credible intervals?
Perform simulations and make predictions
Predictions are computed by simulating from the posterior distribution
Results of these simulations are useful for several purposes, including:
Example; What are the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(11,\ 39)\)?
Probability of distract drivers for next 10 cars
Bayesian analysis is a contrast to frequentist methods
The objective of Bayesian analysis is to compute a posterior
distribution
Contrasts with frequentist statistics is to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information in the form of
a prior distribution
Selection of prior distributions can be performed in a number of ways
The posterior distribution is said to quantify our current
belief
We update beliefs based on additional data or evidence
A critical difference with frequentist models which must be
computed from a complete sample
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval
Predictions are made by simulating from the posterior distribution a
Bayesian analysis is in contrast to frequentist methods
The objective of Bayesian analysis is to compute a posterior
distribution
Frequentist statistics seeks to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information in the form of
a prior distribution
Selection of prior distributions can be performed in a number of ways
The posterior distribution is said to quantify our current
belief
We update beliefs based on additional data or evidence
A critical difference with frequentist models which must be computed from a complete sample
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval